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Abstract 

Assigning significance in high-dimensional regression is challenging. Most computa- 
tionally efficient selection algorithms cannot guard against inclusion of noise variables. 
Asymptotically valid p-values are not available. An exception is a recent proposal 



by Wasserman and Roeder (20081 which splits the data into two parts. The number 
of variables is then reduced to a manageable size using the first split, while classical 
variable selection techniques can be applied to the remaining variables, using the data 
from the second split. This yields asymptotic error control under minimal conditions. 
It involves, however, a one-time random split of the data. Results are sensitive to this 
arbitrary choice: it amounts to a 'p-value lottery' and makes it difficult to reproduce 
results. Here, we show that inference across multiple random splits can be aggregated, 
while keeping asymptotic control over the inclusion of noise variables. We show that 
the resulting p-values can be used for control of both family-wise error (FWER) and 
false discovery rate (FDR). In addition, the proposed aggregation is shown to improve 
power while reducing the number of falsely selected variables substantially. 

Keywords: High-dimensional variable selection, Data splitting, Multiple comparisons, Family- 
wise error rate, False discovery rate. 



1 Introduction 



The problem of high-dimensional variable selection has received tremendous attention in the 



last decade. Sparse estimators like the Lasso (Tibshirani, 1996) and extensions thereof (Zou 
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2006 Meinshausen, 2007) have been shown to be very powerful because they are suitable for 



high-dimensional data sets and because they lead to sparse, interpretable results. 

In the usual work-flow for high-dimensional variable selection problems, the user sets poten- 
tial tuning parameters to their prediction optimal values and uses the resulting estimator as 
the final result. In the classical low-dimensional setup, some error control based on p- values 
is a widely used standard in all areas of sciences. So far, p-values were not available in 



high-dimensional situations, except for the proposal of Wasserman and Roeder (2008). An 



ad-hoc solution for assigning relevance is to use the bootstrap to analyze the stability of the 
selected predictors and to focus on those which are selected most often (or even always). 



Bach (2008) and Meinshausen and Buhlmann (2008) show for the Lasso that this leads to 



a consistent model selection procedure under fewer restrictions than for the non-bootstrap 
case. 



More recently, some progress has been achieved to obtain error control (Wasserman and 



Roeder, 2008 Meinshausen and Buhlmann, 2008). Here, we build upon the approach of 



Wasserman and Roeder (2008) and show that an extension of their 'screen and clean' algo- 



rithm leads to a more powerful variable selection procedure. Moreover, family-wise error rate 



(FWER) and false discovery rate (FDR) can be controlled, while Wasserman and Roeder 



(2008) focus on variable selection rather than assigning significance via p-values. We also 



extend methodology to control of the false discovery rate (Benjamini and Hochberg, 1995) 
for high-dimensional data. 

While the main application of the procedure are high-dimensional data, where the number 
p of variables can greatly exceed sample size n, we show that the method is also quite 
competitive with more standard error control for n > p settings, indeed often giving a better 
detection power in the presence of highly correlated variables. 



This article is organized as follows. We discuss the single-split method of Wasserman and 
Roeder (2008) briefly in Section |5J showing that results can strongly depend on the ar- 
bitrary choice of a random sample splitting. We propose a multi-split method, removing 
this dependence. In Section [3] we prove FWER and FDR-control of the multi-split method, 
and we show in Section [4] numerically for simulated and real data-sets that the method is 
more powerful than the single-split version while reducing substantially the number of false 
discoveries. Some possible extensions of the proposed methodology are outlined in Section [5} 



2 



2 Sample Splitting and High-Dimensional Variable Se- 
lection 



We consider the usual high-dimensional linear regression setup with a response vector Y = 
(Y\, . . . , Y n ) and an n x p fixed design matrix X such that 

Y = X(3 + e, 

where e = {e\,...e n ) is a random error vector with £j iid. M(0,a 2 ) and (3 G W is the 
parameter vector. Extensions to other models are outlined in Section [5j 

Denote by 

S = {j; ± 0} 

the set of active predictors and similarly by iV = S c = {j; f3j = 0} the set of noise variables. 
Our goal is to assign p-values for the null-hypotheses H j : /3j — versus Haj '■ Pj ^ 
and to infer the set 5* from a set of n observations (Xi,Yi), i = l,...,n. We allow 
for potentially high-dimensional designs, i.e. p>n. This makes statistical inference very 



challenging. An approach proposed by Wasserman and Roeder (2008) is to split the data 



into two parts, reducing the dimensionality of predictors on one part to a manageable size 
of predictors (keeping the important variables with high probability), and then to assign 
p-values and making a final selection on the second part of the data, using classical least 
squares estimation. 



2.1 FWER control with the Single-Split Method 



The procedure of Wasserman and Roeder (2008) attempts to control the family- wise error 



rate (FWER), which is defined as the probability of making at least one false rejection. The 
method relies on sample-splitting, performing variable selection and dimensionality reduction 
on one part of the data and classical significance testing on the remaining part. The data are 
splitted randomly into two disjoint groups D in = (X in , Y in ) and D out = (X out , Y out ) of equal 
size. Let S be a variable selection or screening procedure which estimates the set of active 
predictors. Abusing notation slightly, we also denote by S the set of selected predictors. 
Then variable selection and dimensionality reduction is based on D in , i.e. we apply S only 
on D in . This includes the selection of potential tuning parameters involved in S. The idea is 
to break down the large number p of potential predictor variables to a smaller number k <^ip 
with k at most a fraction of n while keeping all relevant variables. The regression coefficients 
and the corresponding p-values P\ , . . . , P p of the selected predictors are determined based 
on by using ordinary least squares estimation on the set S and setting Pa = 1 for all 
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j £ S. If the selected model S contains the true model S, i.e. S D S, the p-values based 
on D out are unbiased. Finally, each p- value Pj is adjusted by a factor \S\ to correct for the 
multiplicity of the testing problem. 

The selected model is given by all variables in S for which the adjusted p- value is below a 
cutoff a e (0, 1), 

S aingle = {jeS:P j \S\<a}. 

Under suitable assumptions discussed later, this yields asymptotic control against inclusion 
of variables in N (false positives) in the sense that 



limsupP[|iVnS , 8i „pie| > 1 



< a, 



i.e. control of the family-wise error rate. The method is easy to implement and yields the 
asymptotic control under weak assumptions. The single-split method relies, however, on 
an arbitrary split into D in and D out . Results can change drastically if this split is chosen 
differently. This in itself is unsatisfactory since results are not reproducible. 



2.2 FWER control with the New Multi-Split Method 

An obvious alternative to a single arbitrary split is to divide the sample repeatedly. For 
each split we end up with a set of p-values. It is not obvious, though, how to combine and 
aggregate the results. 

In the remainder of the section, we will give a possible answer. For each hypothesis, a 
distribution of p-values is obtained for random sample splitting. We will propose that error 
control can be based on the quantiles of this distribution. We will show empirically that, 
maybe unsurprisingly, the resulting procedure is more powerful than the single-split method. 
The multi-split method also makes results reproducible, at least approximately if the number 
of random splits is chosen to be very large. 

The multi-split method uses the following procedure: 

For b = 1,. . . ,B: 

1. Randomly split the original data into two disjoint groups and D^ t of equal size. 

2. Using only D^, estimate the set of active predictors S^. 

3. (a) Using only D^ t , fit the selected variables in with ordinary least squares and 

calculate the corresponding p-values Pf^ for j e S^ b \ 
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(b) Set the remaining p-values to 1, i.e. 



4. Define the adjusted (non- aggregated) p-values as 

pf =min(pf j = l,...,p (2.1) 

Finally, aggregate over the B p-values Pf \ as discussed below. 

The procedure leads to a total of B p-values for each predictor j — 1, . . . , p. It will turn out 
that suitable summary statistics are quantiles. For 7 e (0, 1) define 

g,( 7 ) = min {l, q y ({Pj b) / T , b= 1,. . . ,B})}, (2.2) 
where ? 7 (-) is the (empirical) 7-quantile function. 

A p- value for each predictor j = 1, ... ,p is then given by Qj(j), for any fixed < 7 < 1. We 
will show in Section [3] that this is an asymptotically correct p- value, adjusted for multiplicity. 
To give an example, for a choice of 7 = 0.5, the quantity Qj(0.5) is twice the median of all 
p-values Pj b \ b = 1, . . . , B. 

A proper selection of 7 may be difficult. Error control is not guaranteed anymore if we 
search for the best value of 7. We propose to use instead an adaptive version which selects 
a suitable value of the quantile based on the data. Let 7 m i n G (0, 1) be a lower bound for 7, 
typically 0.05, and define 

Pj = min {l,(l- log 7^) mf Qj(l)} (2.3) 

7e(7min,l) J 



The extra correction factor 1 — log7 m ; n ensures that the family- wise error rate remains 
controlled at level a despite of the adaptive search for the best quantile, see Section [3j 
For the recommended choice of 7 min = 0.05, this factor is upper bounded by 4; in fact, 
1 - log(0.05) « 3.996. 

We comment briefly on the relation between the proposed adjustment to false discovery 



rate ( 


Benjamini and Hochberg, 


1995 


(Holm 


. 


1979 


) controlling procedures. 



While we provide a family-wise error control and 
as such use union bound corrections as in Holm ( 1979[ ), the definition of the adjusted p- 
values (2.3) and its graphical representation in Figure [l] are vaguely reminiscent of the false 
discovery rate procedure, rejecting hypotheses if and only if the empirical distribution of 



p-values crosses a certain linear bound. The empirical distribution in (2.3) is only taken 
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Figure 1: Left: a histogram of adjusted p-values for the selected variable in the motif 
regression data example of Section \4.3 . The single split method picks randomly one of these 
p-values (a l p-value lottery') and rejects if it is below a. For the multi-split method, we reject 
if and only if the empirical distribution function of the adjusted p-values crosses the broken 
line (which is f{p) = max{0.05, (3.996/a)p}) for some p G (0, 1). This bound is shown as 
a broken line for a = 0.05. For the given example, the bound is indeed exceeded and the 
variable is thus selected. 
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for one predictor variable, though, which is either in S or N. This would correspond to a 
multiple testing situation where we are testing a single hypothesis with multiple statistics. 



Figure [T] shows an example. The left panel contains the histogram of the adjustec 



pf for b — 1, . . . , B of the selected variable in the real data example in Section 



p- values 



4.3 



The 



single split method is equivalent to picking one of these p-values randomly and selecting 
the variable if this randomly picked p-value is sufficiently small. To avoid this 'p-value 
lottery', the multi-split method computes the empirical distribution of all p-values Pf^ for 
b = 1, . . . , B and rejects if the empirical distribution crosses the broken line in the right 
panel of Figure [TJ A short derivation of the latter is as follows. Variable j is selected if 
and only if Pj < a, which happens if and only if there exists some 7 G (0.05, 1) such that 



Qj(l) < a /(l — log 0.05) ~ a/3. 996. Equivalently, using definition (2.2), the 7-quantile of 
the adjusted p-values, g 7 (Pj fc ' ) ), has to be smaller than or equal to cry/3. 996. This in turn 
is equivalent to the event that the empirical distribution of the adjusted p-values P^ for 
b — 1, . . . , B is crossing above the bound f{p) = max{0.05, (3.996/a)p} for some p G (0, 1). 
This bound is shown as a broken line in the right panel of Figure [T] 

The resulting adjusted p-values Pj, j — 1, . . . , p can then be used for both FWER and FDR 
control. For FWER control at level a G (0, 1), simply all p-values below a are rejected and 
the selected subset is 

Smuiu = {j : Pj < a}. (2.4) 
that indeed, asymptotically, F(V > 0) < a, where V = \S mu iunN\ 



We will show in Section 



3.2 



is the number of falsely selected variables under the proposed selection (2.4). Besides better 
reproducibility and asymptotic family-wise error control, the multi-split version is, maybe 
unsurprisingly, more powerful than the single-split selection method. 



2.3 FDR control with the multi-split method 



Control of the family- wise error rate is often considered as too conservative. If many rejec- 



tions are made, Benjamini and Hochberg (1995) proposed to control instead the expected 
proportion of false rejections, the false discovery rate (FDR). Let V = \Sf) N\ be the number 
of false rejections for a selection method S and R = \S\ the total number of rejections. The 
false discovery rate is defined as the expected proportion of false rejections 

E(Q), with Q = V/max{l,P}. (2.5) 

For no rejections, R = 0, the denominator ensures that the false discovery proportion Q is 



0, conforming with the definition in Benjamini and Hochberg (1995). 



The original FDR controlling procedure in (Benjamini and Hochberg, 1995) first orders the 
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observed p-values as P(i) < P( 2 ) < 



< P( p ) and defines 



k = max{i : P(j) < -q}. 



V 



(2.6) 



Then all variables or hypotheses with the smallest k values are rejected and no rejection is 



made if the set in (2.6) is empty. FDR is controlled this way at level q under the condition 



that all p-values are independent. It has been shown in Benjamini and Yekutieli (2001) that 



the procedure is conservative under a wider range of dependencies between p-values; see 



also Blanchard and Roquain (2008 ) for related work. It would, however, require a big leap 



of faith to assume any such assumption for our setting of high- dimensional regression. For 



general dependencies, Benjamini and Yekutieli (2001) showed that control is guaranteed at 

-l 



level q 



g(l/2 + log(p)). 



The standard FDR procedure is working with the raw p-values, which are assumed to be uni- 



formly distributed on [0, 1] for true null hypotheses. The division by p in (2.6) is an effective 
correction for multiplicity. The proposed multi-split method, however, is producing already 



adjusted p-values, as in (2.3). Since we are working already with multiplicity-corrected p- 



values, the division by p in (2.6) turns out to be superfluous. Instead, we can order the 
corrected p-values Pj, j = 1, . . . ,p in increasing order P( X ) < P( 2 ) < 
h variables with the smallest p-values, where 

h = max{i : P^ < iq}. 



< P( p ) and select the 



(2.7) 



The selected set of variables is denoted, with the value of h given in (2.7), by 

{j : Pj < P (fc )}, 



S. 



multi;FDR 



(2.8) 



with no rejections, S mu i U -FDR = 0, if P(i) > iq for all z = 1 



p. 



The procedure (2.8) will achieve FDR control at level qYll 



get FDR control at level q, we replace q in (2.7) by q/(Y^i=i 



1 w g(l/2 + logp). To 
completely analogous to 



the standard FDR-procedure under arbitrary dependence of the p-values in Benjamini and 
Yekutieli (2001). We will prove error control in the following section and show empirically 



the advantages of the proposed multi-split version over both the single-split and standard 
FDR controlling procedures in the later section with numerical results. 



3 Error Control and Consistency 



3.1 Assumptions 



To achieve asymptotic error control, a few assumptions are made in Wasserman and Roeder 



(2008) regarding the crucial requirements for the variable selection procedure S. 
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(Al) Screening property: lim r 



P 



SD S 



1. 



(A2) Sparsity property: \S\ < n/2. 

The screening property (A[T]) ensures that all relevant variables are retained. Irrelevant noise 
variables are allowed to be selected, too, as long as there are not too many as required by 
the sparsity property (A[2]). A violation of the sparsity property would make it impossible to 
apply classical tests on the retained variables. 



The Lasso (Tibshirani, 1996) is an important example which satisfies (A[T]) and (A[2]) un- 



der appropriate conditions discussed in Meinshausen and Buhlmann (2006), Zhao and Yu 



(2006), van de Geer (2008), Meinshausen and Yu (2009) and Bickel et al. (2008). The 



adaptive Lasso (Zou, 2006 Zhang and Huang, 2008) satisfies (A[T| and (W2J as well under 
suitable conditions. Other examples include, assuming appropriate conditions, L 2 Boosting 



(Friedman, 2001 Buhlmann, 2006), orthogonal matching pursuit (Tropp and Gilbert, 2007) 



or Sure Independence Screening (Fan and Lv, 2008). 



We will typically use the Lasso (and extensions thereof) as screening method. Other algo- 
rithms would be possible. Wasserman and Roeder ( 2008[ ) studied various scenarios under 
which these two properties are satisfied for the Lasso, depending on the choice of the regu- 
larization parameter. We refrain from repeating these and similar arguments, just working 
on the assumption that we have a selection procedure S at hand which satisfies both the 
screening property and the sparsity property. 



3.2 FWER control 



We proposed two versions for multiplicity-adjusted p- values. One is Qjd) as defined in (2.2) 
which relies on a choice of 7 £ (0, 1). The second is the adaptive version Pj defined in (2.3) 
which makes an adaptive choice of 7. We show that both quantities are multiplicity-adjusted 
p-values providing asymptotic FWER-error control. 

Theorem 3.1. Assume (AUty and (A^. Let a, 7 £ (0, 1). If the null-hypothesis H j : j3j — 
gets rejected whenever Qj(j) < ct, the family-wise error rate is asymptotically controlled at 
level a, i.e. 



limsup P 



mxnQJj) < a 



< a, 



where P is with respect to the data sample and the statement holds for any of the B random 
sample splits. 

A proof is given in the appendix. 
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Theorem 3.1 is valid for any pre-defined value of the quantile 7. However, the adjusted 
p- values Qj{"f) involve the somehow arbitrary choice of 7 which might pose a problem for 
practical applications. We therefore proposed the adjusted p-values Pj which search for the 
optimal value of 7 adaptively. 



Theorem 3.2. Assume (AUty and (A^. Let a G (0, 1). // the null-hypothesis H j : (3j = 
gets rejected whenever Pj < a, the family-wise error rate is asymptotically controlled at level 
a, i.e. 



limsup P 



minP, < a 



< a, 



where the probability P is as in Theorem 3.1 



A proof is given in the appendix. 

A brief remark regarding the asymptotic nature of the results seems in order. The proposed 
error control relies on all truly important variables being selected in the screening stage with 
very high probability. This is our screening property (A[l]). Let A be the event S C S. The 
results above for example in Theorem 3.2| can be formulated in a non-asymptotic way as 



P[*4. H { mux,- S 7v Pj < a}] < a, and P(A) —>■ 1, typically exponentially fast, for n — > 00. 



Analogous remarks apply to Theorem |3 . 1 1 and 3.3 below. 



3.3 FDR control 



The adjusted p-values can be used for FDR control, as laid out in Section 2.3 The set 
of selected variables S mu iti-FDR was defined in (2.8). Here, we show that FDR is indeed 



controlled at the desired rate with this procedure. 

Theorem 3.3. Assum e (A v^ and (A^. Let q G (0,1). Let S mu iu-FDR be the set of selected 
variables, as defined in \2.8 ) and V = \S mu i t i-FDR^N\ and R = \S mu i t y,FDR\- The false discov- 
ery rate (2.5) with Q = V/ max{l,i?} is then asymptotically controlled at level aY^^i' 1 , 
i.e. 



1 

limsup E(Q) < q 2^ -■ 



A proof is given in the appendix. 

As with FWER-control, we could be using, for any fixed value of 7, the values Qj{^), 
j = 1, . . . , p instead of Pj, j — 1, . . . , n. 



We refrain from giving the full details since, in our 



experience, the adaptive version above works reliably and does not require an a-priori choice 
of the quantile 7 that is necessary otherwise. 
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3.4 Model Selection Consistency 



If we let level a = a n — > for n — > oo, the probability of falsely including a noise variable 
vanishes because of the preceding results. In order to get the property of consistent model 
selection, we have to analyze the asymptotic behavior of the power. It turns out that this 
property is inherited from the single-split method. 



Corollary 3.1. Let S S i ng i e be the selected model of the single-split method. Assume that 
a n — > can be chosen for n — > oo at a rate such that lim^oo F[S S i ng i e = S] = 1. Then, for 
any 7 m ; n (see (2.3)), the multi-split method is also model selection consistent for a suitable 
sequence a n , i.e. for S mu i ti = {j G S;Pj < a n } it holds that 



lim P 

n^oo 



s, 



multi 



s 



1. 



Wasserman and Roeder 



1 for various variable se 
scheme. 



(2008) discuss conditions which ensure that lim^oo F[S, 



single 



ection methods such as the Lasso or some forward variable selection 



The reverse of the Corollary above is not necessarily true. The multi-split method can be 
consistent if the single-split method is not. A necessary condition for consistency of the 
single-split method is limsup^^ P[pj^ < a] — 1 for all j G S, where the probability is 
with respect to both the data and the random split-point, as there is a positive probability 
otherwise that variable j will not be selected with the single-split approach. For the multi- 
split method, on the other hand, we only need a bound on quantiles of over b = 1, . . . , B. 
We refrain from going into more details here and rather show with numerical results that the 
multi-split method is indeed more powerful than the single-split analogue. We also remark 



that the Bonferroni correction in (2.1 ), multiplying the raw p- values with the number \S^ \ of 



selected variables, could possibly be improved upon by using ideas in Hothorn et al. (2008), 
further improving the power of the procedure. 



4 Numerical Results 

In this section we compare the empirical performance of the different estimators on simulated 
and real data sets. Simulated data allow a thorough evaluation of the model selection 
properties. The real data set shows that we can find signals in data with our proposed 
method that would not be picked up by the single-split method. We use a default value of 
a = 0.05 everywhere. 
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4.1 Simulations 



We use the following simulation settings: 

(A) Simulated data set with n = 100, p = 100 and a design matrix coming from a centered 
multivariate normal distribution with covariance structure Cov(Xj, X^) = p^~ k \ with 
p = 0.5. 

(B) As @ but with n = 100 and p = 1000. 

(C) Real data set with n = 71 and p = 4088 for the design matrix X and artificial re- 
sponse Y. 



The data set in (|Cj) is from gene expression measurements in Bacillus Subtilis. The p = 4088 
predictor variables are log-transformed gene expressions and there is a response measuring 
the logarithm of the production rate of riboflavin in Bacillus Subtilis. The data is kindly 
provided by DSM (Switzerland). As the true variables are not known, we consider a linear 
model with design matrix from real data and simulating a sparse parameter vector (3 as 
follows. In each simulation run, a new parameter vector f3 is created by either 'uniform' or 
'varying-strength' sampling. Under 'uniform' sampling, \S\ randomly chosen components of 
(3 are set to 1 and the remaining p— \ S\ components to 0. Under 'varying-strength' sampling, 
\S\ randomly chosen components of (3 are set to values 1, . . . , J *S' | . The error variance a 2 is 
adjusted such that the signal to noise ratio (SNR) is maintained at a desired level at each 
simulation run. We perform 50 simulations for each setting. 

The sample-splitting is done such that the model is trained on a data set of size |_(n — l)/2j 
and the p- values are calculated on the remaining data set. This slightly unbalanced scheme 
prevents us from situations where the full model might be selected on the first data set. 
Calculations of p-values would not be possible on the remaining data in such a situation. 



We use a total of B = 50 sample-splits for each simulation run. As in Wasserman and Roeder 



(2008), we compute p-values for all procedures using a normal approximation. Results are 



qualitatively similar when using a t-distribution instead. 

We compare the average number of true positives and the family-wise error rate (FWER) 
for the single- and multi-split methods for all three simulation settings (A)-(C) and vary in 
each the SNR to 0.25, 1, 4 and 16 (which corresponds to population R 2 values of 0.2, 0.5, 
0.8 and 0.94, respectively). The number \S\ of relevant variables is either 5 or 10. As initial 
variable selection or screening method S we use three approaches, which are all based on 



the Lasso (Tibshirani, 1996). The first one, denoted by Sfi xe d, uses the Lasso and selects 
those \n/6\ variables which appear most often in the regularization path when varying the 
penalty parameter. The constant number of \n/Q\ variables is chosen, somewhat arbitrarily, 
to ensure a reasonably large set of selected coefficients on the one hand and to ensure, on 
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the other hand, that least squares estimation will work reasonably well on the second half 
of the data with sample size \n/2\. While the choice seems to work well in practice and can 
be implemented very easily and efficiently, it is still slightly arbitrary Avoiding any such 
choices of non-data adaptive tuning parameters, the second method, S cv , uses the Lasso 
with penalty parameter chosen by 10-fold cross-validation and selecting the variables whose 
corresponding estimated regression coefficients are different from zero. The third method, 
S a dap, is the adaptive Lasso of 
on 10-fold cross-validation wit 



Zou 



(2006) where regularization parameters are chosen based 
;h the Lasso solution used as initial estimator for the adaptive 
Lasso. The selected variables are again the ones whose corresponding estimated regression 
parameters are different from zero. 

Results are shown in Figures [2] and [3] for both the single-split method and the multi-split 
method with the default setting 7 min = 0.05. Using the multi-split method, the average 
number of true positives (the variables in S which are selected) is typically slightly increased 
while the FWER (the probability of including variables in N) is reduced sharply. The 
single-split method has often a FWER above the level a = 0.05 at which it is asymptotically 
controlled while for the multi-split method the FWER is above the nominal level only in few 
scenarios. The asymptotic control seems to give a good control in finite sample settings with 
the multi-split method, maybe apart from the method Sfi xe d on the very high-dimensional 
dataset (C). The single-split method, in contrast, selects in nearly all settings too many 
noise variables, exceeding the desired FWER sometimes substantially. This suggests that 
the asymptotic error control seems to work better for finite sample sizes for the multi- 
split method. Even though the multi-split method is more conservative than the single-split 
method (having substantially lower FWER), the number of true discoveries is often increased. 
We note that for data ([C]), with p = 4088, and in general for low SNR, the number of true 
positives is low since we control the very stringent family-wise error criterion at a = 0.05 
significance level. As an alternative, controlling less conservative error measures is possible 
and is discussed in Section HJ 

We also experimented with using the value of Qjij) directly as an adjusted p- value, without 
the adaptive choice of 7 but using a fixed value 7 = 0.5 instead, i.e. looking at twice the 
median value of all p-values across multiple data splits, as suggested in a different context 



by van de Wiel et al. (2009). The results were not as convincing as for the adaptive choice 



and we recommend the adaptive version with 7 min = 0.05 as a good default choice. 



4.2 Comparisons with adaptive Lasso 



Next, we compare the multi-split selector with the adaptive Lasso (Zou, 2006). We have 



used the adaptive Lasso previously as a variable selection method in our proposed multi- 
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Figure 2: Simulation results for setting ( p4| ) in the top and in the bottom row. Average 
number of true positives vs. the family-wise error rate (FWER) for the single split method 
( C S') against the multi-split version ('M'). FWER is controlled (asymptotically) at a = 0.05 
for both methods and this value is indicated by a broken vertical line. From left to right are 
results for Sfi xe d, S cv and S a da P - Results of a unique setting of SNR, sparsity and design are 
joined by a line, which is solid if the coefficients follow the 'uniform 7 sampling and broken 
otherwise. Increasing SNR is indicated by increasing symbol size. 
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Figure 3: The results of simulation setup (C). 
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split method. The adaptive Lasso is usually employed on its own. There are a few choices 
to make when using the adaptive Lasso. We use the same choices as previously. The initial 
estimator is obtained as the Lasso solution with a 10-fold CV-choice of the penalty parameter. 
The adaptive Lasso penalty is also obtained by 10-fold CV. 



Despite desirable asymptotic consistency properties (Huang et al. , 2008), the adaptive Lasso 
does not offer error control in the same way as Theorem 3.1 does for the multi-split method. 
In fact, the FWER (the probability of selecting at least one noise variable) is very close to 1 
with the adaptive Lasso in all the simulations we have seen. In contrast, our multi-split 
method offers asymptotic control, which was seen to be very well matched by the empirical 
FWER in the vicinity of a = 0.05. Table [T] shows the simulation results for the multi-split 
method using S a da P and the adaptive Lasso on its own side by side for a simulation setting 
with n = 100, p = 200 and the same settings as in (A) and (B) otherwise. The adaptive 
Lasso selects roughly 20 noise variables (out of p = 200 variables), even though the number 
of truly relevant variables is just 5 or 10. The average number of false positives is at most 
0.04 and often simply with the proposed multi-split method. 
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Table 1: Comparing the multi-split method with CV-Lasso selection, S a dap, with the selection 
made when using the adaptive Lasso and a CV-choice of the involved penalty parameters for 
a setting with n = 100 and p = 200. 
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There is clearly a price to pay for controlling the family-wise error rate. Our proposed multi- 
split method detects on average less truly relevant variables than the adaptive Lasso. For 
very low SNR, the difference is most pronounced. The multi-split method selects in general 
neither correct nor wrong variables for SNR = 0.25, while the adaptive Lasso averages 
between 2 to 3 correct selections, among 9-12 wrong selections. Depending on the objectives 
of the study, one would prefer either of the outcomes. For larger SNR, the multi-split method 
detects almost as many truly important variables as the adaptive Lasso, while still reducing 
the number of falsely selected variables from 20 or above to roughly 0. 

The multi-split method seems hence beneficial in settings where the cost of making an 
erroneous selection is rather high. For example, expensive follow-up experiments are usually 
required to validate results in bio-medical applications and a stricter error control will place 
more of the available resources into experiments which are likely to be successful. 



4.3 Motif regression 



We apply the multi-split method to a real data set about motif regression (Conlon et al. 



2003). For a total of n — 287 DNA segments we have the binding intensity of a protein 
to each of the segments. These will be our response values Yi, . . . , Y n . Moreover, for p = 
195 candidate words ('motifs') we have scores Xij which measure how well the jth motif is 
represented in the ith DNA sequence. The motifs are typically 5-15bp long candidates for 
the true binding site of the protein. The hope is that the true binding site is in the list 
of significant variables showing the strongest relationship between the motif score and the 
binding intensity. Using a linear model with S a da P , the multi-split method identifies one 
predictor variable at the 5% significance level. The single-split method is not able to identify 
a single significant predictor. In view of the asymptotic error control and the empirical 
results in Section [4] there is substantial evidence that the selected variable corresponds to a 
true binding site. For this specific application it seems desirable to pursue a conservative 
approach with low FWER. As mentioned above, we could control other, less conservative 
error measures as discussed in Section [5] 



4.4 Comparison with standard low-dimensional FDR control 



We mentioned that control of FDR can be an attractive alternative to FWER if we expect 
a sizable number of rejections. Using the corrected p- values Pi, . 



, P p , a simple FDR- 
controlling procedure was derived in Section 2.3 and its asymptotic control of FDR was 



shown in Theorem |3.3| We now look empirically at the behavior of the resulting method 
and its power to detect truly interesting variables. Turning again to the simulation setting 
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Figure 4: The results of FDR controlling simulations for the multi-split method (dark bar) 
and standard FDR control (light bar). The settings of n,p,p, \S\ and SNR are given below 
each simulation. The height of the bars corresponds to the average number of selected im- 
portant variables. For p > n, the standard method breaks down and the corresponding bars 
are set to height 0. 
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(jAp, we vary the sample size n, the number of variables p, the signal to noise ratio SNR, the 
correlation p between neighboring variables and the number s of truly interesting variables. 

It was shown already above extensively that the multi-split method is preferable to the single- 
split method. Here, we are more interested in comparison to well understood traditional FDR 
controlling procedures. For p < n, the standard approach would be to compute the least 
squares estimator once for the full dataset. For each variable a p- value is obtained and the 



FDR controlling procedure as in (2.6) can be applied. This approach obviously breaks down 
for p > n. Our proposed approach can be applied both to low- dimensional (p < n) and 
high-dimensional (p > n) settings. 

In all settings, the empirical FDR of our method (not shown) is below q = 0.05 and often 
close to zero. Results regarding power are shown in Figure [4] for control at q — 0.05. 

It is maybe unexpected, but the multi-split method tracks the power of the standard FDR 
controlling procedure quite closely for low-dimensional data p < n. In fact, the multi-split 
method is doing considerably better if n/p is below, say, 1.5 or the correlation among the tests 
is large. An intuitive explanation for this behavior is that, as p approaches n, the variance 
in each estimated coefficient vector under the OLS estimate is increasing substantially. This 
in turn increases the variance of all OLS components fy, j = l,...,p and reduces the 
ability to select the truly important variables. The multi-split method, in contrast, trims 
the total number of variables to a substantially smaller number on one half of the samples 
and suffers then less from an increased variance in the estimated coefficients on the second 
half of the samples. Repeating this over multiple splits leads thus to a surprisingly powerful 
variable selection procedure even for low-dimensional data. Nevertheless, we think that the 
main application will be high-dimensional data, where the standard approach breaks down 
completely. 



5 Extensions 

Due to the generic nature of our proposed methodology, extensions to any situation where 
(asymptotically valid) p- values Pj for hypotheses H j (J = l,...,p) are available are 
straightforward. An important class of examples are generalized linear models (GLMs) 
or Gaussian Graphical Models. The dimensionality reduction step would typically involve 
some form of shrinkage estimation. An example for Gaussian Graphical Models would be 



the recently proposed 'Graphical Lasso' (Friedman et al. 2008). The second step would rely 



on classical (e.g. likelihood ratio) tests applied to the selected submodel, analogous to the 
methodology proposed for linear regression. 

In some settings, control of FWER at, say, a = 0.05 is too conservative. One can either 
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resort to control of FDR, as alluded to above. Alternatively, FWER control can easily be 
adjusted to control the expected number of false rejections. Take as an example the adjusted 
p-value Pj, defined in (2.3). Variable j is rejected if and only if Pj < a. (For the following, 
assume that adjusted p- values, as defined in (2.1), are not capped at 1. This is a technical 
detail only as it does not modify the proposed FWER controlling procedure.) Rejecting 
variable j if and only if Pj < a controls FWER at level a. Instead, one can reject variables 
if and only if Pj/K < a, where K > 1 is a correction factor. Call the number of falsely 
rejected variables V, 

V = J2 HPj/K < a}. 

Then the expected number of false positives is controlled at level limsup^^ K[V] < aK. A 
proof of this follows directly from the proof of Theorem 3^ Of course, we can equivalently 
set k = aK and obtain a control limsup,^^ E[V] < k. For example, setting k — 1 offers a 
much less conservative error control, if so desired, than control of the family-wise error rate. 



6 Discussion 

We proposed a multi-sample-split method for assigning statistical significance and construct- 
ing conservative p-values for hypothesis testing for high-dimensional problems where the 
number of predictor variables may be much larger than sample size. Our method is an ex- 



tension of the single-split approach of Wasserman and Roeder (2008) and is extended to false 



discovery rate (FDR) control. Combining the results of multiple data-splits, based on quan- 
tiles as summary statistics, improves reproducibility compared to the single-split method. 
The multi-split method shares with the single-split method the property of asymptotic error 
control and model selection consistency. We argue empirically that the multi-split method 
usually selects much fewer false positives than the single-split method while the number of 
true positives is slightly increased. The main area of application will be high-dimensional 
data, where the number p of predictor variables exceeds sample size n, as standard ap- 
proaches rely on least-squares estimation and thus fail in this setting. It was, however, 
shown that the method is also an interesting alternative to standard FDR and FWER con- 
trol in lower-dimensional settings as the proposed FDR control can be more powerful if p is 
reasonably large but smaller than sample size n. The method is very generic and can be used 
for a broad spectrum of error controlling procedures in multiple testing, including linear and 
generalized linear models. 
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A Proofs 



Proof of Theorem 3. 1 , For technical reasons we define 



K (P) = pW^s c S {b) } + l{3 £ S {b) }. 



(A.9) 



are the adjusted p-values if the estimated active set contains the true active set. Oth- 
erwise, all p-values are set to 1. Because of assumpfou (A§ aud for fixed B, P[/vf> 



P) u > for aU 6 = 1,...,B] on a set v4 n with P[v4 n ] — > 1. Therefore, we can define all the 
quantities involving also with , and it is sufficient to show under this slightly altered 
procedure that 

P[minQ 7 -(7) < a] < a. 
In particular we can omit here the limes superior. 

We also omit for the proofs the function min{l,-} from the definitions of Qji'j) and Pj in 
(2.2 ) and (2.3 ) respectively. The selected sets of variables are clearly unaffected and notation 



is simplifies considerably. 

Define for u € (0, 1) the quantity 7ij(u) as the fraction of bootstrap samples that yield Kj 



(6) 



less than or equal to u, 



*i(«) = ^Ei{#i w <«}. 

6=1 



Note that the events {Qj(j) < «} and {^-(cry) > 7} are equivalent. Hence, 
P minQ,(7) < a] < ^E^iW < <*}} = E^ 7 ^) > 7} 



(A.10) 



Using a Markov inequality, 

J2®[l{^H) >7}] <^E E K-(«7)]- 



iev 



jev 



By definition of 7Tj('), 

1 

7 



E E M«7)] = ^E E El{Af <a 7 } 



B 



75 



6=1 jeNnsW 



Moreover, using the definition of Kf in (|A.9|) 



E 



l{Af } < « 7 } 



< P 



Pf ] <a 1 \S<Z 



Q!7 
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This is a consequence of the uniform distribution of P^ given S C . Summarizing these 
results we get 

- - B 



p 



minQ 1 (7) < a 



<~E4 E 



75 



6=1 jewnsO) 



< a, 



which completes the proof. 



□ 



Proof of Theorem 3.2 . As in the proof of Theorem 
Analogously, instead of P^ we work with Kj ■ 



3.1 



we will work with instead of P^ . 



For any Kf ] with j G iV and a e (0, 1), 



E 



l{K®<ar,} 



7 



< a. 



(A.11; 



Furthermore, 



E 



< « 7 } 



max 

jew 



7 



r l{Af } <«7} 

<E 2^ — 



7 



and hence, with (A. 11) and using the definition rtA 9 of K) ' , 



(6) 



7 



E 



1« } < « 7 } 



max 

j<=N 



7 



< 



4 E 



a 



< a. 



(A.12) 



For a random variable U taking values in [0, 1] 

\{U < «7} 



sup 

7e(7min,l) 



U>a, 
a/U a7 m i n < U < a, 

l/7min U <C tt7min- 



Moreover, if U has a uniform distribution on [0, 1], 
1{U < 07}- 



E 



sup 

7e(7min,l) 7 



Hence, by using that Kj has a uniform distribution on [0, 1] for all j G N, conditional on 

S C 



Q7n 



7mi L n^ + / ax x dx = a(l - log7 min ). 



"7min 



E 



sup 

7e(7min,l) 



l{^ 6) <a 7 } 



7 



< E 



sup 

7e(7min,l) 



l{kf ] < a 7 } 



S c £ (6) 



7 



a(l - log7 n 
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Analogously to (A. 12), we can then deduce that 

l{/^ 6) <« 7 } 



j(Z TV L 7e(7 m in,l) 



7 



< a(l - log7 n 



Averaging over all bootstrap samples yields 

^Ef=il«V7<«} 



E E L f up r 



7 



< a(l - log7 n 



Using again a Markov inequality, 



E | sup l{iTj(aj) > 7} < q(1 — log7 n 

j e JV 76(7min,l) 



where we have used the same definition for tt^- ( ■ ) as in the proof of Theorem 3.1 



Since the events {Qj(j) < «} and {^-(cry) > 7} are equivalent, it follows that 



rrr. L 7 e(7mm,i) 



a 



< a(l - log7 n 



implying that 



Vp inf g,( 7 )(i-iog 7min ) < 

z — L7e(7 min ,l) 



a 



< a. 



Using the definition of Pj in (2.3) 



ie/v 



and thus, by the union bound, 
which completes the proof. 



P 



minP,- < a 



< a, 



< a, 



(A.13) 



□ 



Proof of Theorem 3.3. We use identical notation to the proof of Theorem 1.3 in Benjamini 



and Yekutieli (2001). An exception is that we use the value q instead of q/m in the FDR- 
controlling procedure since we are working with adjusted p-values. Let 



p ijk = F{{P i e[{j-l)q,jq\}a,ndC i 



(On 

k )■> 



where Cu is the event that if variable i were rejected, then k — 1 other variables were also 



rejected. Now, as shown in equation (10) and then again in (28) in Benjamini and Yekutieli 



(2001) 



P -y k 



k 

ieN k=l j=i 
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Using this result, we use in the beginning a similar argument to Benjamini and Yekutieli 

pool]), 



v v 



E (Q) = EE*X>* = EEEi^ 
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i£N k=l j=l 
V V 



i£N j=l k=j 
V i P 
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E E E ^ E E 7 E^ fc = E 7 E E^ fc 

j=l J i£N k=l 



iGN 3=1 k=j 



J 



] 

i£N j=l J k=l 



Let us denote 



ieN k=l 



The last equation (A. 14) can then be rewritten as 



hq) < E7/(i) = /(i)+E7(E/o")-E/o": 



3=2 3 j'=l 
i P 



3'=1 



3=1 J J j/=i ^3'=i 



Note that, in analogy to (27) in Benjamini and Yekutieli (2001), 

v v 

Ew* = p ({ p e [(j - i)qjq}} n (IJcf)) = p ( p e [0' - 



fc=i 



and hence 



p 

= EE*** = E P ( P G [0 - 

ieN fe=i igJV 



from which it follows by (A. 13) in the proof of Theorem 3.2 that 



(A. 14) 



(A.15) 
(A.16) 



E/(j') = E p ( Pi ^'«) 

3'=1 iEN 



Using this in (A. 17), we obtain 



e(q) < E^ _ ^tt)^ + -^ = (E^^tV + 1 )<7 = <?E- 



~i 3 3 + 

3 = 1 

which completes the proof. 



(A.17) 

□ 
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Proof of Corollary 3.1. Because the single-split method is model selection consistent, it must 
hold that Pjmaxjgs Pj | S | < a n ] — > 1 for n — > oo. Using multiple data-splits, this prop- 
erty holds for each of the B splits and hence P[max,,' g s maxj, < a n ] — > 1, which 
implies that, with probability converging to 1 for n — > oo, the quantile max Jg s 
is bounded from above by a n . The maximum over all j £ S of the adjusted p- values 
= (1 - log7 min ) inf 7e(7minil) Qjij) is thus bounded from above by (1 - log7 min )a n , again 
with probability converging to 1 for n — > oo. □ 
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